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Abstract 

This paper describes a new approach and a system 
SCREEN 1 for fault-tolerant speech parsing. Speech 
parsing describes the syntactic and semantic analy- 
sis of spontaneous spoken language. The general ap- 
proach is based on incremental immediate flat anal- 
ysis, learning of syntactic and semantic speech pars- 
ing, parallel integration of current hypotheses, and the 
consideration of various forms of speech related er- 
rors. The goal for this approach is to explore the par- 
allel interactions between various knowledge sources 
for learning incremental fault-tolerant speech pars- 
ing. This approach is examined in a system SCREEN 
using various hybrid connectionist techniques. Hy- 
brid connectionist techniques are examined because of 
their promising properties of inherent fault tolerance, 
learning, gradedness and parallel constraint integra- 
tion. The input for SCREEN is hypotheses about 
recognized words of a spoken utterance potentially 
analyzed by a speech system, the output is hypothe- 
ses about the flat syntactic and semantic analysis of 
the utterance. In this paper we focus on the general 
approach, the overall architecture, and examples for 
learning flat syntactic speech parsing. Different from 
most other speech language architectures SCREEN 
emphasizes an interactive rather than an autonomous 
position, learning rather than encoding, flat analysis 
rather than in-depth analysis, and fault-tolerant pro- 
cessing of phonetic, syntactic and semantic knowledge. 

Introduction and Motivation 

In the past, the analysis of spontaneous speech ut- 
terances as syntactic and semantic case frame rep- 
resentations received relatively little attention. Al- 
though there had been some early attempts for com- 
bination (Erman et al. 1980) the restricted speech 
and language techniques at that time forced each 
field, speech and language processing, to concentrate 
on developing further techniques separately. There- 
fore, in the last decade there have been primarily 
isolated modular attempts to build speech analyzers 
(e.g., (Lee, Hon, k Reddy 1990; McClelland k Elman 

1 SCREEN stands for Symbolic Connectionist Robust 
EnterprisE for Natural language 



1986)) or language analyzers (e.g., (Hobbs et al. 1992; 
Kitano k Higuchi 1991)). 

However, recent approaches attempt to integrate 
speech and language earlier to reduce the extensive 
space of acoustic, syntactic and semantic hypotheses 
(Pyka 1992; Young et al. 1989). The MINDS sys- 
tem (Young et al. 1989) is a speech language sys- 
tem which combines a speech recognizer (Lee, Hon, k 
Reddy 1990) with expectation-driven language analy- 
sis. The main contribution of the MINDS system is its 
early integration of speech hypotheses with language 
hypotheses in order to restrict the search space for 
speech processing. On the other hand, the MINDS 
system relies heavily on hand-coded pragmatic knowl- 
edge from a single domain. 

The ASL system (e.g. (Pyka 1992)) is a speech lan- 
guage system which focused on the examination of in- 
teractions in a very general architecture. This system 
has an architecture similar to a blackboard architecture 
but without explicit control. Autonomous components 
can send and receive hypotheses, but the overall archi- 
tecture and relationships between the components are 
flexible. While the MINDS system emphasized the use 
of pragmatic knowledge for supporting speech process- 
ing, the ASL system focused rather on syntactic and se- 
mantic knowledge. The ASL system has an extremely 
flexible architecture which can avoid early mistakes in 
favoring a particular architecture. On the other hand, 
this flexibility also requires very sophisticated commu- 
nication operations for complexer interactions. 

Both MINDS and ASL belong to the state-of-the-art 
architectures in speech language systems. However, in 
both systems the language knowledge is basically man- 
ually encoded and domain-dependent. Furthermore, 
currently errors like false starts, hesitations, correc- 
tions, and repetitions have only been implemented in 
a rudimentary pragmatic manner in the MINDS sys- 
tem. We designed SCREEN as a system for learning 
fault-tolerant incremental speech parsing. SCREEN 
deals with repairs (Levelt 1983), false starts, hesita- 
tions, and interjections. Since connectionist techniques 
have inherent fault tolerance and learning capabilities 
we explore these properties in a hybrid connectionist 



architecture. In this hybrid connectionist architecture 
we make use of learning connectionist representations 
as far as possible, but we do not rule out symbolic rep- 
resentations since they may be natural and efficient for 
some subtasks (e.g. for testing lexical equality of two 
words). 

The data we currently use come from the German 
Regensburg corpus 2 which contains dialogs at a railway 
counter (more than 48000 words). As a first step we 
used transcribed real utterances of the Regensburg cor- 
pus for SCREEN. This corpus contains a great deal of 
spoken constructions and occurring errors. In general 
we also have to deal with other errors introduced by 
the speech recognizer. However, for the purpose of this 
paper we concentrate on transcribed real speech utter- 
ances in order to illustrate the screening approach for 
speech parsing but our overall architecture SCREEN 
has the long-term goal of using speech input directly. 

In this paper we will first show the underlying prin- 
ciples of fault-tolerant speech parsing in SCREEN and 
the overall architecture. Then we will describe results 
from flat syntactic analysis with a hybrid connectionist 
architecture using spoken utterances. 

Principles of fault-tolerant speech 
parsing with SCREEN 

Our general approach is based on incremental imme- 
diate flat analysis, learning of syntactic and semantic 
speech parsing, and the consideration of various forms 
of speech related errors. The goal for this approach 
is to explore the parallel interactions between vari- 
ous knowledge sources for learning incremental speech 
parsing and to provide experimental contributions to 
the issue of architectures for speech language systems. 

Screening approach for interpretation level: 
Since speech is spontaneous and erroneous, a com- 
plete interpretation at an in-depth level will often fail 
due to violated expectations. Therefore, we pursue a 
screening approach which learns an interpretation at a 
flat level which is more accessible for erroneous speech 
parsing. In particular, the screening approach struc- 
tures utterances at the phrase group level. 

Previous work towards this screening approach has 
been described as scanning understanding in SCAN 
(Wermter 1992). The scanning understanding primar- 
ily focused on phrase processing while our screening 
approach goes further by integrating and extending 
speech properties into a new system SCREEN for un- 
restricted robust spontaneous language processing. 

Learning speech parsing: The analysis of an ut- 
terance as syntactic and semantic case frame represen- 
tations is among the most important steps for language 
understanding. However, in addition to semantic and 
syntactic understanding per se, there are two central 
aspects: learning and speech interaction. We examine 

2 For clarity the illustrated examples are shown in their 
English translation. 



to what extent hybrid connectionist techniques can be 
used for learning and integrating semantic and syntac- 
tic case frame representations for speech utterances. 

Dealing with errors: For building a speech lan- 
guage system we have to consider two main sources 
of errors: errors at the speech level and errors at the 
language level. Within a real speech system, errors are 
based on incomplete or noisy input so that many incor- 
rect words are detected. On the other hand, even un- 
der the assumption that a speech recognizer comes up 
with the correct word interpretations for an utterance, 
there are errors at the language level like repairs, rep- 
etitions, interjections and partially incomplete phrases 
and sentences (e.g., telegraphic language). 

SCREEN: A system for fault-tolerant 
speech parsing 

SCREEN has a parallel architecture with many indi- 
vidual modules which communicate interactively and 
in parallel similar to message passing systems. There is 
no central control; rather messages about incremental 
hypotheses at the current time are sent between spec- 
ified modules in order to finally provide an incremen- 
tal syntactic and semantic interpretation for a speech 
utterance. For the realization, we use hybrid connec- 
tionist techniques. That is, we integrate connectionist 
representations where they can be used directly and 
efficiently, but we do not rule out the use of other 
symbolic or stochastic representations. Connection- 
ist techniques are examined because of their favorable 
properties of inherent fault tolerance, learning, grad- 
edness, and parallel constraint integration. Therefore, 
SCREEN is not only an approach to examine fault- 
tolerant speech parsing but also to test the extent to 
which current connectionist techniques can be pushed 
for building a real-world complex speech language sys- 
tem. 

An overview 

Figure 1 shows an abstract overview about the 
SCREEN architecture. There are basically five parts 
where each part consists of several modules. Each 
module can have a symbolic program and a connec- 
tionist network. The description of SCREEN as five 
parts follows its main functionalities but does not sug- 
gest a fixed hierarchical architecture. Rather, the mod- 
ules in the five parts work in parallel and can exchange 
messages directly. 

The speech interface part receives input from a 
speech recognizer as word hypotheses and provides an 
analysis of the syntactic and semantic plausibility of 
the recognized words. This analysis can be used by 
the speech recognizer for further speech analysis and by 
the subsequent language parts for filtering only impor- 
tant plausible speech hypotheses for further language 
analysis. The category part receives words of an ut- 
terance and provides basic syntactic, basic semantic as 
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well as abstract syntactic and abstract semantic cate- 
gories. The correction part receives knowledge about 
words and phrases as well as their categories and pro- 
vides the knowledge about the occurrence of a certain 
error, like a repair or repetition. The subclause part 
is responsible for the detection of subclause borders in 
order to distinguish different subclauses. Finally, the 
case frame part is responsible for the overall interpre- 
tation of the utterance. This part receives knowledge 
about abstract syntactic and semantic categories of a 
phrase and provides the integrated interpretation. 

A more detailed overview of SCREEN 

Although we can not describe all the hybrid connec- 
tionist modules in SCREEN due to space restrictions, 
we illustrate the overall architecture and some exam- 
ples for individual modules (see figure 2). We focus 
here only on the category part, the correction part, 
and the case frame part, and within these parts we will 
mainly focus on syntactic processing. The arrows illus- 
trate incremental parallel flow of syntactic/semantic 
hypotheses. All modules in the same part in figure 2 
are able to work in parallel while the processing of an 
utterance is incremental. While the modules in the 
correction part analyze a certain word x the modules 
in the category part are able to analyze the next word 
x+1 and so on. 

The category part consists of the modules for dis- 
ambiguating basic categories and determining abstract 
categories. The module BAS-SYN-DIS (BAS-SEM- 
DIS) disambiguates syntactic (semantic) basic cate- 
gories. SYN-PHR-START (SEM-PHR-START) deter- 
mines the start of a new syntactic (semantic) phrase 
group. The assignment of abstract syntactic (seman- 
tic) categories is performed by the module ABS-SYN- 
CAT (ABS-SEM-CAT). 

The goal of the error part is to detect errors at 



a sub- word level, word level, or phrase group level. 
At the sub-word level the module PAUSE? checks if 
a current input is a pause, INTERJECTION? checks 
whether it is an interjection or unknown phonetic in- 
put. At the word level LEX-WORD-EQ? checks if the 
current word is lexically equal to the previous word 
and BAS-SYN-EQ? (BAS-SEM-EQ?) if it is syntacti- 
cally (semantically) equal to the previous word. The 
modules at the phrase level are similar to the mod- 
ules at the word level. LEX-START-EQ? checks if the 
lexical start of two phrases is equal. ABS-SYN-EQ? 
(ABS-SEM-EQ?) checks if the abstract syntactic (se- 
mantic) category of a current phrase group is equal to 
the category of the previous phrase group. The out- 
put of the modules of the correction part described 
so far is used in the error testing modules PAUSE- 
ERROR?, WORD-ERROR?, and PHRASE-ERROR?. 
PAUSE-ERROR? checks if a pause, interjection, or un- 
known phonetic input occurred, and WORD-ERROR? 
(PHRASE-ERROR?) determines if there is evidence 
for a repair at the word level (phrase group level). 

In the case frame part a frame is filled correspond- 
ing to the syntactic and semantic categories of con- 
stituents. The module SLOT-FINDING is used to 
find the appropriate slot for a current phrase group. 
SLOT-ERROR? tests if the proposed slot is possible 
based on the compatibility of abstract syntactic and 
semantic categories for a current phrase group. VERB- 
ERROR? checks if new frames have to be generated. 
INTERPRETATION is needed to convert the internal 
word-by- word message structure of SCREEN to a more 
structured representation useful for further high level 
processing. 

For illustration we focus on just a few modules for 
flat syntactic parsing. The interface of a module is rep- 
resented symbolically, the learning part of a module is 
supported by a connectionist network. While not all 
modules have to contain connectionist networks they 
will be used as far as possible for automatic knowl- 
edge extraction. For illustrating the learned perfor- 
mance of some modules of SCREEN, table 1 shows 
three modules with a simple recurrent network SRN 
(Elman 1990), the number of units in the input I, hid- 
den H, and output O layer. Training (testing) was per- 
formed with 37 (58) utterances with 394 (823) words. 
We used the training instances (words) based on the 
complete real world utterances including the errors. 
Under the assumption of more regular than erroneous 
language the general regularities will have been picked 
up by the network, even if it has been trained with the 
erroneous real-world data. For instance 99% (93%) of 
the basic syntactic categories of the training (test) set 
could be assigned correctly (see figure 2). The last 
row describes the combined overall performance of the 
modules BAS-SYN-DIS and ABS-SYN-CAT; only if 
both SRN-networks provide the desired category with 
maximum output activation it is counted as a correct 
combined output. 
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Figure 2: SCREEN: some modules of the category, correction, and case frame parts 
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93% 
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Table 1: Performance of some modules 



An example for speech parsing 

In this section we describe the incremental flat syntac- 
tic processing using two real transcribed utterances in 
SCREEN. The first sentence in figure 3 does not con- 
tain a repair, while the second in figure 4 does. The 
first sentence starts with the word "Yeah" which is 
classified as adverb by the module BAS-SYN-DIS and 
as part of modus group 3 by ABS-SYN-CAT. At the 
beginning of an utterance SYN-PHR-START classifies 
a word as start of a new phrase group. The second 
word "I" is classified as a pronoun, is part of a noun 
group, and starts a new phrase group. The compar- 
ison of the first word (resp. first phrase group) and 
second word (resp. second phrase group) does not 
result in any hints for a pause-, word-, or phrase er- 
ror. Later in the utterance the ABS-SYN-EQ? module 
finds that the two syntactic phrase groups "from Re- 
gensburg" and "to Dortmund" are syntactically equal. 
But syntactic equality of two phrase groups alone is 

3 interrogative pronouns and confirmation words 



too weak to determine a phrase error since other mod- 
ules (LEX-START-EQ? and ABS-SEM-EQ?) suggest 
that these two phrase groups are different with respect 
to their start and abstract semantic categories. When 
the pause "." occurs the module PAUSE-ERROR? is 
triggered and the pause is deleted. 

For this first utterance the analysis has been rather 
straightforward while in the next utterance (see fig- 
ure 4) we describe a more difficult example with er- 
ror corrections. PAUSE-ERROR? is responsible for 
deleting pauses, interjections, and phonetic material. 
BAS-SYN-DIS classifies almost all interjections and 
phonetic material correctly. Only "[u]" is misclassified 
as adverb rather than interjection in BAS-SYN-DIS. 
PAUSE-ERROR? does not use this adverb informa- 
tion but only the output of PAUSE? and INTERJEC- 
TION?. As the module PAUSE-ERROR? determines 
these errors, interjections and pauses are deleted incre- 
mentally so that the phrase groups "at Monday" and 
"at Monday" follow each other directly. Since both 
groups are prepositional groups and since they have 
the same lexical start the modules LEX-PHRASE- 
EQ? and ABS-SYN-EQ? trigger PHRASE-ERROR?. 
Therefore the first phrase group "at Monday ..." is re- 
placed by just "at Monday". Similarly, other types of 
repairs (e.g. "at Monday" replaced by "at Tuesday", 
"in the morning") will be dealt with in the future. 

Overall functionality and performance 

SCREEN provides a fault-tolerant interpretation of a 
potentially faulty utterance. The words of the fault- 
tolerant interpretation of the faulty utterance have 
been underlined in order to illustrate this function- 
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Figure 4: Sentence with corrections 



Figure 3: Syntax part of a sample parse of a sentence 



ality in figures 3 and 4. Currently corrections occur 
most reliably for interjections, pauses, unknown words, 
and syntactic repairs with lexical equality of phrase 
starts (as "at Monday" and "at Monday morning" in 
figure 4). On the other hand, an example for a cur- 
rently existing undesired interpretation is "eh . in the 
morning at ten . in any case not after . not before 
nine" . In this case "after" should be replaced by "be- 
fore nine". However, these two prepositional phrases 
do not follow each other directly but there is an addi- 
tional separating "not" . Currently SCREEN can only 
deal with phrase repairs which follow each other di- 
rectly since such repairs occur much more often (Lev- 
elt 1983). However, considering interjections, pauses, 
unknown input, and simple forms of syntactically de- 
tectable repairs in our 95 utterances we currently reach 
a desired overall interpretation of 93%. 



Discussion 

We have described a screening approach to fault- 
tolerant speech parsing based on flat analysis. A 
screening approach can particularly support learning 
and robustness, which are properties that previous 
approaches did not emphasize (Young et al. 1989). 
The use of flat representations should stimulate fur- 
ther discussion since, in contrast to more traditional 
speech language systems which used highly structural 
hand-coded parsers, we use less structure but support 
fault tolerance and learning better. Therefore, speech 
parsers based on a screening flat analysis, learning, and 
fault tolerance should be more scalable, adaptive and 
more domain-independent. 

Our approach to speech parsing is new since it makes 
new contributions to general architectures for speech 
parsing as well as new contributions to the hybrid 
connectionist techniques being used. With respect to 
the architecture we suggest a modular but interac- 
tive parallel architecture where modules exchange mes- 
sages about incremental hypotheses without a particu- 
lar control interpreter. With respect to the techniques 



we proposed the use of hybrid connectionist representa- 
tions. While certain subtasks (like the symbolic equal- 
ity detection of incorrectly repeated words) can be re- 
alized best using symbolic techniques, there are other 
subtasks with incompletely known functionality where 
fault-tolerant connectionist learning is advantageous. 

The work which is closely related to ours is the con- 
nectionist PARSEC parser for conference registrations 
(Jain 1992), the hybrid connectionist JANUS speech 
translation system (Waibel et al. 1992), and the hybrid 
connectionist SCAN system for general phrase analysis 
(Wermter 1992). In general, connectionist techniques 
in PARSEC, JANUS, SCAN and SCREEN particu- 
larly support learning necessary knowledge where pos- 
sible. However, SCREEN focuses more on exploring 
interactive parallel architectures and more on model- 
ing fault tolerance. 

Currently, the overall architecture as well as all the 
syntactic modules in SCREEN have been fully im- 
plemented, trained, and tested for a corpus of utter- 
ances with 1200 words. Although the overall SCREEN 
project is at an intermediate stage we believe the new 
architecture and the finished syntactic modules con- 
tribute substantially to new fault-tolerant learning ar- 
chitectures for speech language systems. Further work 
will focus on additional semantic modules for fault- 
tolerant case-role assignment and the top down inter- 
actions to speech modules in order to reduce the search 
space of speech hypotheses. 

Conclusion 

We have described the architecture and implementa- 
tion of a new speech parser which has a number of in- 
novative properties: the speech parser learns, it is par- 
allel and fault-tolerant, and it directly integrates incre- 
mental processing from speech into language processing 
using flat analysis. We have illustrated the process- 
ing in SCREEN with flat syntactic analysis, but in a 
similar way we are currently pursuing a flat semantic 
analysis. On the one hand, flat analysis can provide 
a parallel shallow processing in preparation for a more 
in-depth analysis for high-level dialog understanding 
and inferencing. On the other hand, flat analysis can 
potentially provide necessary restrictions for reducing 
the vast search space of word hypotheses of speech rec- 
ognizers. Therefore, learned flat analysis in a screening 
approach has the potential to provide a new impor- 
tant intermediate link in between in-depth processing 
of complete dialogs and shallow processing of speech 
signals. 
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